In [3]:
import pandas as pd
import glob
import os
import numpy as np
from __future__ import division
import sqlalchemy as sqla
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib
Data comes from two sources:
My goal will be to predict the number of taxi dropoffs in an area based on demographic and other stats about the area. Ideally, I would like to understand the demographics of taxicab users and how well taxicabs provide service to NYC residents. I may later expand my data to look at green "boro cabs" (which are only allowed to pick up passengers outside of the Manhattan central business district, data also availible from TLC) and Uber cars(https://github.com/fivethirtyeight/uber-tlc-foil-response), to compare their demographics of usage.
If I do something with the Uber data, I plan to write up a story in the style of fivethirtyeight.com that I may submit to them, as they have done a series of stories on this Uber data.
I would also expect that a paper comparing taxi services would be useful for city government, who want to ensure that there is good coverage throughout the city and throughout demographic groups.
I'm only using census tracts where the population is greater than 1000. Most of the tracts with population less than that are parks or ocean, and the errors associated with data on these are generally very high. Also, taxi dropoffs are probably not representative of the resident population for these areas.
In [4]:
engine = sqla.create_engine('postgresql://mydatabase:notyours@localhost:5432/TaxiData',echo=False)
full=pd.read_sql_query('SELECT a.* FROM lotsofdata AS a WHERE a.totalpopulation>=1000',engine).set_index('fipscodes')
For this analysis, I'll be looking at data from three columns of the ACS, with per-capita dropoffs as my dependent variable, i.e. the dropoffs computed above divided by the population given by the census.
This data uses the 2010 census (the most recent), and 5-year data for the American Community Survey, 2007-2011. I figure the ACS data splits the difference between my 2010 population data and the 2013 drop-off data.
The census uses some funky abreviations for their column headers, and I haven't changed them. Here's a description:
I compute the ratio of non-driving commuters to total commuters, labeled nondrivercommuterrat
In [5]:
subset=full[['MOGE001','MOGE011','MRUE001','abridged2013ycdrpoffpc']]
subset['nondrivercommuterrat']=((subset['MOGE001']-subset['MOGE011'])/subset['MOGE001'])
subset.replace(np.inf,np.nan,inplace=True)
subset.dropna(inplace=True)
subset.head()
Out[5]:
In [7]:
# Plotting the data:
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0
buff=100
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.autoscale(tight=True)
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
a=plt.scatter(subset['MRUE001'],subset['abridged2013ycdrpoffpc'],alpha=0.5,label='Data',c=subset['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
#plt.legend(loc='upper left')
plt.show()
There appears to be a bit of a linear relationship between the logs of the dropoffs per-capita and per-capita income, but it's not that great. There is a funky area where there almost looks like there is a negative slope, but we can see that this area tends to have less non-driver commuters based on the color shown.
In [28]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('cool'))
#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')
#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
If we look at the data with only a high non-driver commuter ratio, the linear relationship is very clear. That funky corner only consists of lower non-driver commuter ratios.
As for the rest of the data, the relationship isn't really obvious. At least using just per-capita income as a dependent variable, there doesn't seem to be much correlation (see blob below), and I will need to explore other variables. I might want to use some more advance machine learning techniques in order to figure out which data is useful.
In [9]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']<comd)]
bsub=subset[(subset['nondrivercommuterrat']>comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $<$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $>$ 0.8')
#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')
#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
We can do some regression on the high non-driver commuter data though:
In [31]:
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()
Out[31]:
In [30]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
plt.autoscale(enable=True, axis='y', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
In [32]:
fig=sm.qqplot(fit.resid,line='45')
fig.show()
In [135]:
hist=fit.resid.hist(alpha=0.8)
I have to admit, I'm not quite sure what to think of these plots. Clearly my data isn't exactly normally distributed, but this is the real world so that's not surprising. Does it deviate from a normal distribution enough that I should be worried? I don't know.
I'm going to take a look at the outliers and high leverage points regardless:
In [14]:
from statsmodels.graphics import regressionplots
from statsmodels.stats import outliers_influence
influence=outliers_influence.OLSInfluence(fit)
infframe=influence.summary_frame()
outliers=infframe[np.abs(infframe['student_resid'])>3].index
In [45]:
asub.loc[outliers]
Out[45]:
In [46]:
outliers.tolist()
Out[46]:
In [49]:
from IPython.display import Image
Image(filename='Outliers.png')
Out[49]:
These five outliers consist of:
In [52]:
asub.drop(outliers,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()
Out[52]:
In [53]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
plt.autoscale(enable=True, axis='y', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
Removing the outliers pops up the $R^2$ and, more importantly, creates a more representative fit line.
Next I'll look at high leverage points:
In [63]:
highleverage=infframe[infframe['hat_diag']>(infframe['hat_diag'].mean()*3)].index
In [77]:
Image(filename='HighL_Outliers.png')
Out[77]:
These are some of the highest income census tracts in the US (world?), possibly the highest. I feel like I read that somewhere.
Dropping these high leverage points as well as the outliers gives us:
In [64]:
asub.drop(highleverage,inplace=True)
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()
Out[64]:
The $R^2$ is slightly lower, but the parameters are probably better representative of the whole of the data.
In [65]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
plt.autoscale(enable=True, axis='y', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
Below is all the data, so that you can see the dropped outliers and high leverage points as well. There are a total of ten out of the 682 total data points that are dropped, so we are dropping a very small fraction of the data.
In [76]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
#comd=0.8
#asub=subset[(subset['nondrivercommuterrat']>comd)]
#bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
plt.autoscale(enable=True, axis='y', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.scatter(subset.loc[outliers]['MRUE001'],subset.loc[outliers]['abridged2013ycdrpoffpc'],alpha=0.25,label='Outliers',color='red')
plt.scatter(subset.loc[highleverage]['MRUE001'],subset.loc[highleverage]['abridged2013ycdrpoffpc'],alpha=0.3,label='High leverage',color='orange')
plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R}^\mathsf{2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
plt.legend(loc='lower right')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
Now we can see that the remaining data is much closer to normally distributed:
In [71]:
fig=sm.qqplot(fit.resid,line='45')
fig.show()
In [72]:
hist=fit.resid.hist(alpha=0.8)
We've got a fit that explains most of the variance, with an $R^2$ of 0.7. I suspect that I won't be able to further explain this subset (non-driver commute ratio > 0.8) of the data, but maybe I will.
However, there is still lots to do. I should probably do an analysis of error in this data, to see if there is really any more variance that could possibly be explained (i.e. much of the variance might be due to random fluctuations). I also might want to do some analysis to better understand the causal relationship between these two variables. Many of these dropoffs are not very representative of the population, but instead are common destinations, eg Grand Central and Penn Stations, or the area around Central Park. In this case, the common destination causes high rents and thus high incomes, so it can't be concluded that the rich are more likely to take cabs.
Also, to really do this analysis properly, I should't be using all the data and should do some sampling. I figure I can get to that once I've got a model that explains the rest of the data as well.
In [ ]: